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Abstract 

Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data 
very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific 
characteristics and limitations that require either consideration during project design, or which must be addressed 
during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and 
analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers 
(including the Genome Analyzers l/ll/lle/llx and the new HiScan and HiSeq) represent a widely used platform 
providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible- 
terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry 
have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the 
end of short-insert molecules, as well as increased error rates and short read lengths complicate many 
computational analyses. We discuss here some of the factors that influence the frequency and severity of these 
problems and provide solutions for circumventing these. Further, we present a set of general principles for good 
analysis practice that enable problems with sequencing runs to be identified and dealt with. 



Background 

Recent advances in DNA sequencing have changed the 
field of genomics making it possible to generate giga- 
bases of genome and transcriptome sequence data at 
substantially lower cost than was possible just ten years 
ago http://www.genome.gov/sequencingcosts/. The rela- 
tive affordability of these high-throughput sequencers 
and the potential to generate large amounts of sequence 
data at lower cost means that scientists outside of tradi- 
tional sequencing facilities are now faced with the chal- 
lenges associated with design of large-scale projects and 
analysis of the data generated. This poses significant 
challenges for many groups since the inherent limita- 
tions of these platforms, and particular artifacts asso- 
ciated with sequences generated on these platforms, 
need to be understood and dealt with at various stages 
of the project including planning, sample preparation, 
run processing and downstream analyses. 

We present here an analysis of challenges encountered 
in using the Illumina sequencing instruments. A 
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thorough description of the Solexa/Illumina sequencing 
technology as well as a comparison to other platforms is 
available elsewhere [1-6]. We revisit here only the 
aspects relevant for project design and data analysis. 

Figure 1 shows the steps from the DNA sample prepara- 
tion to sequence read outs with quality scores. Indepen- 
dent of the actual application, Solexa/Illumina sequencing 
requires that the molecules to be determined are con- 
verted into special sequencing libraries. This is achieved 
by adding specific adapter sequences on both ends of frag- 
mented DNA molecules; allowing molecules to be ampli- 
fied, immobilized and primed for sequencing. For 
sequencing, typically double-stranded DNA libraries are 
provided and melted using sodium hydroxide to obtain 
single stranded molecules. These are then immobilized 
(hybridization and amplification on a solid phase; bridge 
amplification [1,7]) in one or more channels of an 8 -chan- 
nel flow cell. The immobilization and subsequent bridge- 
amplification creates randomly scattered clusters, consist- 
ing of more than one thousand copies of the original 
sequence in very close proximity to each other. After the 
bridge amplification step, cluster molecules are largely 
double stranded. One of the strands then has to be 
removed to obtain single stranded, identically oriented 
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Sample (a) Library preparation 
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(d) Reversible terminator chemistry 
and read out of incorporated dyes 
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Flowcell preparation 




(b) Immobilization 
and solid phase 
amplification 



(c) Linearization, 3' blocking, 
sequence primer hybridization 



(e) Image registration and intensity extraction (f) Base calling and quality scoring 



Cycle 1 



Cycle 2 



Cycle 3 



@SOLEXA-GA03_0001_PEi_SG : 5 : 1 : 1033 : 5267 

AGACAGACACAGAGNAAGACCCAGTCCGCCACACAGGCAAACTCA 

+SOLEXA-GA03_0001_PEi_SG : 5 : 1 : 1033 : 5267 

4— '-(/. 23/044! 51/+//. 400/-/1-62/.6021834///6 



Figure 1 lllumina sample preparation and sequencing. Illumina sequencing requires that a DNA sample (a) is converted into special 
sequencing libraries. This can be achieved by shearing DNA to a designated size and adding specific adapter sequences on both ends of the 
DNA molecules (b). These adapters allow molecules to be amplified and immobilized in one or more channels of an 8-channel flow cell (c). 
Immobilization and solid-phase amplification create randomly scattered clusters, consisting of a few thousand copies of the original molecule in 
very close proximity to each other. One of the DNA strands is removed to obtain single stranded, identically oriented copies, 3' ends of the DNA 
are blocked and a sequencing primer hybridized on the adapter sequences. Afterwards, the reversible terminator chemistry is performed (d). 
Here, four differently labeled nucleotides are provided and used for extension of the primers by DNA polymerases. The polymerase reaction 
terminates after the first base incorporation since the nucleotides used are not only labeled, but also 3'-blocked. After washing away free 
nucleotides, the nucleotides incorporated are readout by piece-wise imaging of the flow cell. Then, the terminator and fluorophore are removed 
and another incorporation cycle started. The four images are overlaid (registered) and light intensities extracted for each cluster and cycle using 
a cluster position template obtained from the first instrument cycles (e). Resulting intensity files serve as input for base calling, the conversion of 
intensity values into bases and quality scores (f). 



copies of the starting molecule. This is achieved by selec- 
tive cleavage of base modifications of oligonucleotides on 
the flowcell. After the free 3' ends of the DNA have been 
blocked, the copies can be sequenced by hybridizing a 
sequencing primer onto the adapter sequences and start- 
ing the reversible terminator chemistry. Here, four differ- 
ently labeled nucleotides are provided and used for 
extension of the sequencing primers by DNA polymerases. 
The DNA polymerase reaction terminates after the first 
base incorporation since the nucleotides used are not only 
labeled, but also S'-blocked (i.e. they carry a terminator 



group at the third carbon atom of the sugar, which pre- 
vents further extension). After free nucleotides are washed 
away, the nucleotides being incorporated are read by cap- 
turing the light signal of the fluorophore labels after laser 
excitation. Imaging of the flow cell is carried out in so- 
called tiles which are the units in which the flowcell is 
imaged and data processed. The terminator and fluoro- 
phore are then removed and another incorporation cycle 
started [1]. 

Initially, the number of sequencing cycles, and thereby 
the length of the sequence reads, was limited to 26 
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cycles because of steeply increasing sequencing error. 
Between 2008 and 2010 there were several technical 
updates to the Genome Analyzer (GA) platform includ- 
ing improvements in mechanics, chemistry and software. 
Even though sequencing error still increases with each 
cycle, up to 150 sequencing cycles are currently per- 
formed with reasonable error profiles (average error 
below 1%, and up to 10% in the final cycles). Further, 
flow cell cluster densities were increased from 5-12 mil- 
lion clusters to about 35-60 million clusters per lane 
(and twice that for HiSeq instruments; where clusters 
on the top and bottom of the flow cell are read). A 
technical update made sequencing of the reverse strand 
of each molecule possible. Using this "paired-end 
sequencing" approach for determining the reverse 
strand, doubles the amount of sequence data generated. 
Known insert size, and thereby a known distance separ- 
ating the paired reads obtained, provides additional 
information for later assembly or mapping [8]. This 
technical update of the Genome Analyzer in 2008, the 
Paired End (PE) module, also allowed the hybridization 
of further sequencing primers in the same strand orien- 
tation, making it possible to sequence a sample index (i. 
e. barcode) as part of the ligated adapter [9,10]. Such an 
index read allows for multiple samples to be sequenced 
in one lane (multiplexing). These can later be computa- 
tionally separated based on the sequence of their index. 

During progression of the sequencing run, or when 
images for all cycles have been collected (depending on 
the setup and version), the four images captured per tile 
are overlaid (registered) and light intensities are 
extracted for each cluster and cycle [1]. The resulting 
cluster position template is then aligned with images of 
all cycles and the intensities minus the surrounding 
background in the four different images extracted. 
Resulting intensity files serve as input for base calling - 
the conversion of intensity values into bases. Base call- 
ing on the Illumina platform is complicated by at least 
two effects: (1) a strong correlation of the A and C 
intensities as well as of the G and T intensities due to 
similar emission spectra of the fluorophores used and 
their limited separation by optical filters, and (2) depen- 
dence of the signal for a specific cycle on the signal of 
the cycles before and after, known as phasing and pre- 
phasing, respectively. Phasing and pre-phasing describe 
the loss of synchrony in the readout of the sequence 
copies of a cluster. Phasing is caused by incomplete 
removal of the 3' terminators and fluorophores as well 
as sequences in the cluster missing an incorporation 
cycle. Pre-phasing is caused by the incorporation of 
nucleotides without effective S'-blocking. The proportion 
of sequences in each cluster which are affected by phas- 
ing and pre-phasing increases with cycle number; ham- 
pering correct base identification [11-14]. 



From this whole process, the Illumina user typically 
obtains sequences and per base quality scores. The set 
of sequences for each lane is usually quality filtered and 
the user gets a summary report for judging run quality. 
Finally, the Illumina CASAVA package provides addi- 
tional tools and an interface to the visualization routines 
in Illumina's Genome Studio. Different commercial as 
well as free programs are available that replace some 
parts of the processing such as image analysis [15], base 
calling [11-15], quality assessment (e.g. TileQC [16] or 
FastQC [17]), mapping [18-21], as well as downstream 
data analysis and processing [8,22-25]. There is a large 
community of users and developers for the Illumina 
platform; the http://seqanswers.com website is an excel- 
lent resource when starting to explore the variety of 
programs available for analyzing the data generated. 

Results 

We present each stage of a sequencing project from the 
generation of sequencing libraries to the base-calling of 
the sequences. For each step we discuss the potential 
effects on data analysis and final data quality. Where 
possible we offer suggestions and guidelines on how to 
avoid specific artifacts that arise during sequencing. 

Sequencing libraries, minimum insert size and adapter 
artifacts 

The most important requirement for a DNA library to 
be sequenced on the Illumina platform is the presence 
of specific outer adapter sequences, complementary to 
the oligonucleotides on the flow cell used for cluster 
generation, the so-called "grafting sequences". As differ- 
ent sequencing primers can be used (see below), the rest 
of the library design is very flexible and various library 
preparation protocols with partially distinct adapter 
sequences are used for specific applications. Library 
adapters can be added by single strand ligation (e.g. Illu- 
mina small RNA protocol), double strand blunt-end 
ligation (e.g. for a multiplex protocol [10]), double 
strand overhang ligation (e.g. A-overhang for Illumina 
genomic library protocols, and restriction enzyme over- 
hangs in the Illumina DGE protocols), or by extension 
from overhanging primers (e.g. multiplex PCR or mole- 
cular inversion probes [26,27]). Each of these 
approaches has a different susceptibility to the creation 
of library adapter dimers, chimeric sequences and other 
library artifacts. Each therefore requires a different 
approach to enrich for only those molecules with cor- 
rectly added adapters, and to remove short/no insert 
molecules and molecules which are too long (> 800nt) 
from the library before sequencing. While short-insert 
molecules will, as described below, directly impact data 
analysis, longer molecules will perform differently in 
flow cell generation and generate more wide-spread and 
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less dense clusters. If not accounted for by modified 
cluster generation protocols, these will result in lower 
quality reads. 

Failure to perform an enrichment during library pre- 
paration has two potential effects: (i) the artifact 
sequences may have a negative impact on the image 
analysis and base-calling which are both challenged by 
an overrepresentation of one sequence population (see 
below) and (ii) sequencing of large numbers of such 
artifacts is uneconomical and lowers the potential num- 
ber of informative sequences that can be generated per 
run. Libraries prepared from small amounts of input 
material tend to suffer from a higher fraction of library 
artifacts due to the relative abundance of adapter oligo- 
nucleotides compared to insert molecules. 

It is possible to computationally post-process sequen- 
cing data in cases where enrichment has not been 



performed. Figure 2, exemplifies for the Illumina Nlalll 
DGE protocol (a protocol for digital gene expression tag 
profiling) that adapter chimeras might be created which 
are of comparable length as the targeted library mole- 
cules and thus may not be removed by selecting a speci- 
fic library insert-size (e.g. by gel length selection, silica 
column purification or Solid Phase Reversible Immobili- 
zation (SPRI) purification [28]). In this case, a program 
like TagDust [29] can be used with the original adapter 
and primer oligonucleotide sequences to identify such 
artifacts in a library (Figure 2B). This program can be 
either used to directly remove these sequences or, for a 
representative lane, its results can be clustered and the 
most frequent ones used with other software tools. 

Inappropriate size selection during library preparation 
may also complicate analysis due to partial sequencing 
of the adaptor at the sequence ends. Thus, when 
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S- B: Most frequent TagDust filtered sequences 

GEX Adapter 2.1 

TCGTATGCCGTCTTCTGCTTG 
35191 GTATGCCGTCTTCTGCTT 
4 7 33 AA- TCGTATGCCGTCTTCT 

310 9 TCGGAC-TCGTATGCCGTC 
2 9 63 G-TCGTATGCCGTCTTCTG 
2 8 75 TCGGACTGTAGAATCGTA 
2 818 ATGGC-TCGTATGCCGTCT 
2 33 9 AGGAG-TCGTATGCCGTCT 
2 3 07 T C - TCGTATGCCGTCTTCT 

2108 TCGGACTGTAGAACTCTT 
193 6 A-TCGTATGCCGTCTTCTG 
1886 AGGAGT-TCGTATGCCGTC 
18 8 0 CGTATGCCGTCTTCTGCT 
16 67 GTATGCCGTCTTCTTCTT 
152 7 AG- TCGTATGCCGTCTTCT 

150 9 CCAG-TCGTATGCCGTCTT 
13 66 GTGA-TCGTATGCCGTCTT 
12 3 8 GTG-TCGTATGCCGTCTTC 
12 0 9 TCGGACTGTAGA-TCGTAT 

TCGGACTGTAGAACTCTGAAC 

GEX Adapter 1.2 



Matches human Nlalll restriction 
site chr20:+1 86841 6-1 868436 

1907 GTTTCAGGAGTTTATTTT 

GEX Adapter 1.1 

ACAGGTTCAGAGTTCTACAG TCCGA CA TG 
12 68 GCCACCCTCTACAG-CCGA 



Figure 2 Adaptors and adaptor chimeras are a common sources of sequence artifacts Specific outer adapter sequences, complementary 
to the grafting sequences on the flow cell are essentially the only requirement for sequencing a DNA library on the Genome Analyzer platform. 
As different sequencing primers can be used, library design is very flexible and various protocols with partially distinct adapter sequences have 
been established. The Illumina Nlalll DGE tag protocol illustrated here (a protocol for digital gene expression tag profiling) uses short adapters 
which are not compatible with paired end sequencing and are added by overhang ligation (A). For this protocol the majority of adapter dimers 
are removed by a gel excision step after library preparation. However, the protocol may also create adapter chimeras with a length comparable 
to the targeted library molecules. The resulting chimera sequences also show the sequences required for cluster generation as well as the 
necessary priming site, causing them to be sequenced together with the real DGE tags. A program like TagDust [29] can be used with the 
original adapter and primer oligonucleotide sequences to identify such artifacts (B). Shown are the twenty most frequent identified artifacts from 
one lane with human DGE tags, as well as the oligosequences they might be based on. One of the 20 sequences seems to be a real DGE tag 
that was incorrectly identified as an artifact. 
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selecting for insert-size, it should be considered that 
current experimental methods generally do not provide 
precise length cutoffs. The lower cutoff selected should 
therefore be well-above the desired sequencing length. 
For sequence reads where part of the adapter sequence 
is included, the position in the sequence read at which 
the adapter sequence begins has to be identified and the 
read trimmed appropriately. Unfortunately, this is not 
part of the standard Illumina data processing and also 
non-trivial for short adapter fragments, especially given 
the increasing sequencing error at the end of reads. If 
reads are not filtered for known chimeras and trimmed 
for adapter sequences, these may interfere with map- 
ping/alignment and reads will either be incorrectly 
excluded or placed incorrectly. In both cases down- 
stream data analysis will be affected. 

In order to test how Illumina's ELAND mapper as 
well as the widely used mapping program BWA [20] are 
impacted by incipient adapter sequence, we simulated 
101-cycle reads of an Illumina Paired End genomic 
library with 10,000 reads for every adapter start point 
between 1 to 350nt and the error profile observed for 
an actual run of this length (Figure 3). Considering that 
both mapping programs implement very different 
approaches (seed alignment versus semi-global align- 
ment of the whole read respectively), the performance 
of Illuminas ELAND mapper is expected to be different 
from BWA. Since ELAND requires only a fixed seed in 
the beginning of the read (typically of 32nt length) adap- 
ters starting after this seed region should not affect 
ELAND's mapping. Indeed, ELAND maps 98% of all 
simulated reads of at least 30nt insert size (2nt of adap- 
ter sequence being compensated by 2 mismatches being 
allowed in the seed), while BWA only reports 98% suc- 
cessful mappings for reads with an insert size of at least 
97nt. More relevant for many analyses, however, is the 
number of mappings reported to be uniquely placed and 
whether they are mapped at the correct position in the 
genome. ELAND reports a uniquely placed 20nt-insert- 
size read, but it is placed incorrectly (as are all uniquely 
placed reads reported up to an insert size of 67nt). 
BWA reports the first three uniquely placed fragments 
(mapping quality above 20) for an insert size of 83nt (2 
of them are correctly placed). If we require that 98% of 
the reads are correctly placed, ELAND achieves this for 
insert sizes of 83nt and above (14nt of adapter), while 
BWA can only compensate with mismatches for 4nt of 
adapter sequence (97nt insert size). However, BWA pro- 
vides a lower total number of false positive placements 
due to the inclusion of adapter sequence (8490 vs. 
6308). Moreover, for an insert size of at the least read 
length, BWA reports 99.999% of uniquely placed reads 
(94.2% of all reported alignments) at the correct geno- 
mic positions, while ELAND only reports 98.757% of 



the uniquely placed reads (83.8% of all reported align- 
ments) at the correct genome coordinates. BWA there- 
fore provides a more accurate mapping of these reads 
for downstream analysis. 

While length selection and dimer removal are impor- 
tant for the cost-effective sequencing of a library and 
downstream data analysis, experimental methods to 
achieve these generally consume sample material and 
may bias molecule representation. It is therefore often 
only practical to apply a minimum of these purification 
steps in order to maintain library quantity and complex- 
ity. In such cases, downstream sequence filtering prior 
to data analysis becomes extremely important. 

Short-insert libraries in paired-end sequencing 
experiments 

When libraries containing inserts shorter than the sum 
of forward and reverse read cycles are created, these can 
be sequenced from both ends to obtain higher quality 
sequence information for the overlapping sequence part. 
For such paired-end reads the correct identification of 
the adapter is eased by maximizing autocorrelation of 
the two reads as well as requiring identical adapter start 
positions for both reads [30-32]. This strategy is more 
powerful than alignment(-like) approaches used for 
identifying adapter starts in single reads, which fre- 
quently remove sequence from the read ends that match 
the adapter by chance, or which do not identify real 
adapter sequence due to the higher sequencing error at 
the end of reads. Thus, for short insert libraries, paired 
end sequencing is preferable. As previously reported 
[30], the read merging performed for these short-insert 
libraries considerably decreases the number of errors 
and creates sequences reflecting the original outer mole- 
cule length (e.g. of interest for authenticity of ancient 
DNA samples [31]). Applying this merging approach to 
the simulated data set described above, but this time 
using both paired-end reads, we see a factor of 5 reduc- 
tion in the error rate of all merged sequences (average 
error of 0.24% reduces to 0.05%; Additional File 1). For 
sequences shorter or equal to the read length a reduc- 
tion by a factor of about 21 (0.146% to 0.007%) is 
observed. 

Library contamination 

Sample contamination during library preparation from 
other DNA/RNA sources might be an important issue 
for some types of analyses and applications. Contamina- 
tion may be introduced by the experimenter or may 
stem from lab chemicals and equipment. Library pre- 
parations starting from low amounts of sample DNA 
and protocols using single strand ligation procedures 
can be considered the most prone to contamination. 
While avoidance of contamination is the most desirable 
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Figure 3 Effects of adapter sequence inclusion on mapping. Untrimmed adapter sequence at the read ends can interfere with alignment/ 
mapping. We simulated 101 -cycle human genomic shotgun reads for an lllumina Paired End library with 10,000 reads for every adapter starting 
point between 1 to 350nt, and the error profile observed for an actual run of this length. On this data set, we tested how ELAND and BWA are 
affected by inclusion of adapter sequence: (A) ELAND requires only a fixed seed (here 32nt) in the beginning of the read. Adapters beginning 
after this seed region may therefore have no effect on the output. ELAND reports 98% successful mappings for all simulated reads of at least 
30nt insert size (2nt of adapter sequence being compensated by 2 mismatches allowed in the seed), BWA only reports 98% successful mappings 
for reads with an insert size of at least 97nt. (B) Frequently only uniquely placed molecules are considered in data analysis. ELAND reports the 
first uniquely placed fragment for 20nt insert size. BWA reports the first three uniquely placed fragments (mapping quality above 20) for an 
insert size of 83nt. (C) All uniquely placed reads reported by ELAND up to an insert length of 67nt are placed incorrectly (when comparing to 
the coordinates the sequence was extracted from), as is one of the 3 reported by BWA for an insert size of 83nt. When requiring 98% correct 
placements, ELAND handles up to 14nt of adapter (83nt insert size), while BWA can only compensate with mismatches for 4nt of adapter 
sequence (97nt insert size). (D) For analysis purposes, BWA shows the better performance due to the lower number of false positive placements. 
Moreover, for an insert size of at least the read length (i.e. no adapters interfering with the alignment), BWA reports 99.999% of uniquely placed 
reads (94.2% of all reported alignments) at the designated genomic positions, while ELAND only reports 98.757% of the uniquely placed reads 
(83.8% of all reported alignments) at the correct position. 
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approach, it has been suggested (e.g. [33]) that reads can 
be filtered by the alignment to the putative contaminant 
sequence before data analysis. However, such filtering 
may introduce biases in the data, especially if sequences 
are short and/or the evolutionary distance between con- 
taminant and sample is low. This is a frequent problem 
in ancient DNA studies of early modern humans, and 
Neandertals where contamination with even small 
amounts of modern DNA can quickly dominate sequen- 
cing output. Here the fraction of contamination can be 
deduced from informative sites (i.e. sites of known fixed 
differences between species/populations) and the frac- 
tion of contaminant molecules determined [31,32,34]. 
This ratio can then be used in statistical models during 
data analysis. If no informative sites are known, esti- 
mates of contamination may be obtained from biallelic 
or triallelic sites in haploid/diploid sequences. Hence, 
also the sex of the sample may be exploited by counting 
Y chromosomal alignments in female samples and 
determining x chromosomal heterozygosity for males 
[35,36]. 

Cross-contamination after library preparation (e.g. 
during preparation of the sequencing run) can be easily 
identified and filtered from the final sequencing data 
when sample-specific barcodes are included in the 
libraries and determined during sequencing [9,10]. A 
different problem may arise when pools of barcoded 
libraries are amplified simultaneously; in such experi- 
ments "jumping PCR" may cause barcodes to be trans- 
ferred between samples [37-40]. High error in the 
sequencing, amplification or synthesis of barcodes, espe- 
cially for those with limited sequence distance from one 
another, as well as barcode contamination during hand- 
ling or mixed clusters/small physical cluster distance on 
the flow cell can lead to sample misidentification. In 
projects for which data analysis is susceptible to any of 
these types of contamination, appropriate measures have 
to be taken to minimize contamination and to estimate 
its impact on analysis results. 

Alternative sequencing primers 

The sequencing primers used in sequencing of libraries 
constructed for different applications may differ. During 
flow cell preparation, different sequencing primers can 
be hybridized in each lane. However, in contrast to the 
first read prepared on the Cluster Station/cBot, it is not 
possible to use lane specific primers for the reverse read 
or for the index read prepared in the sequencing instru- 
ment. While it is technically possible to mix primers, 
this is generally not recommended due to potential mis- 
priming. As a result, for paired-end as well as indexed 
sequencing runs the same library type has to be loaded 
on all eight lanes of the flow cell. Thus, while functional 
custom sequencing libraries are easily designed, it is 



generally not advisable to handle several library types 
and sequencing primers simultaneously due to the 
increased risk of sequencing primer mix-ups in the pre- 
paration process and the potentially long waiting times 
for runs with appropriate primer configurations when 
small-scale projects have to be sequenced on a shared 
flow cell. Further, when projects require a control lane 
or control read spike-in (see corresponding section 
below), the use of different adapter sequences/sequen- 
cing primers means that matching control libraries are 
also required. 

Machine adjustment and run preparation 

The correct adjustment of the sequencing instrument is 
an important prerequisite for producing high quality 
sequencing data. Information on settings for each 
instrument and software version is available from the 
vendor's technical documentation. Individual instru- 
ments as well as sequencing kits and flow cells have var- 
iation. Therefore, correct instrument adjustment should 
be frequently checked and preparation of a sequencing 
run done with much attention to detail. Besides from 
attention to the liquids and optics of the instrument, the 
correct functioning of thermal elements and cooling 
devices are equally important for high quality runs. 

When loading the chemistry and flow cell, all connec- 
tors should be checked for leaks, the correct priming of 
pumps with the reagents validated and long waiting 
times avoided. This is to prevent air bubbles in the 
pump, tubing and finally on the flow cell, which could 
cover parts of the images (Figure 4A) or reduce chemis- 
try efficiency, due to smaller effective volumes and 
incomplete coverage of the inner flow cell surface. Leaks 
most frequently originate at the flow cell connectors 
(GA) or flow cell gaskets (HiSeq). While gaskets are 
replaced with each flow cell, the flow cell connectors of 
GA instruments may wear away or break. Loading 
reagents is simplified for HiSeq instruments, which store 
all reagents in two racks within the instrument - one for 
sequencing reagents and one for paired end (PE) 
reagents. Special funnel caps have to be used with the 
reagents to allow sippers to be lowered into the liquids. 
Bending of the sippers may cause reduced flow from the 
reagent. The HiSeq can only hold ten tubes for PE and 
multiplex chemistry (the GA holds up to 25 tubes). 
Thus, three of these ten positions are shared for PE pre- 
paration and multiplex chemistry on HiSeq, requiring 
more user interaction for multiplex runs and increasing 
the risk of reduced reagent flow and air in the instru- 
ment tubing. 

In cases where several sequencing kits are required 
(typically more than 36 cycles for GA and 50 cycles 
for HiSeq), all sequencing kits required for sequencing 
reads on the same strand should be thoroughly mixed 
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Figure 4 Origin of image artifacts. Correct instrument adjustment is an important prerequisite for producing high quality sequencing data. 
Preparation and start of a sequencing run has to be done with careful attention to avoid or identify the following instrumentation artifacts: (A) 
Air bubbles, caused by leaks, insufficient priming of reagent pumps and long waiting times. Bubbles can obscure parts of the images or reduce 
chemistry efficiency. (B) Particles in the sequencing chemistry (e.g. crystals from an unfiltered incorporation mix) frequently result in image 
artifacts. (C) Incorrect adjustment of stage flatness and stage tilt can cause distortions, i.e. parts of the image are sharp while the rest is out of 
focus. A similar effect limited to tiles at the flow cell edges, can originate from liquids covering the flow cell surface. (D) Reflections in the GA 
instrument can cause variation in cluster brightness, like the commonly observed band of bright clusters in column 2 of lane 8. (E) If the position 
of laser excitement is not in sync with imaging (footprint) on GA instrument, a black straight band can be observed at the edges of multiple 
tiles (partially with comb like slots). (F) If this effect is limited to tiles at the flow cell edges, oil coverage is insufficient. 

V J 



before splitting them by tube volume in order to pre- 
vent later problems with base calling due to different 
fluorophore intensities/chemistry performance. On GA 
and HiSeq, sequencing reagents can only be primed 
through the flow cell, while the PE/multiplex reagents 
can be primed using a separate priming pump. When 
refilling the sequencing reagent during the run, air 
bubbles in the hoses must be prevented as no addi- 
tional priming can be performed. Further, the incor- 
poration mix should be filtered and centrifuged before 
adding the polymerase and loading. This reduces 
chemistry crystals and lint artifacts on the images. 
Particles on the flow cell may result in excessively 
strong light signals (Figure 4B) covering large areas of 
the imaged flow cell. These may obscure actual clus- 
ters or cause cluster-like structures which are then 
detected during the cluster identification process (Fig- 
ure 5A). 

After performing the first base incorporation, the cor- 
rect adjustment of the horizontal positioning (Z-value) 
of the flow cell stage to the focus layer(s) as well as the 
flow cell tilt should be checked with a first cycle report 
before the actual first cycle imaging. For Genome Analy- 
zer instruments, the complete illumination of tiles (foot- 
print and oil application), the adjustment of the focus 
laser, the maximum focus range and stage tilt should be 
checked in addition. Some of the necessary adjustments 
can be done directly by a skilled lab technician, if a pro- 
blem is correctly identified. 



Problems with the stage flatness can be identified if 
image distortions, i.e. only part of the imaged tile is 
sharp while the rest seems out of focus (Figure 4C), are 
seen on multiple tiles and not limited to the edges of 
the flow cell. If this effect is limited to tiles at the flow 
cell edges, liquids (including oil in the case of a GA 
instrument) covering the flow cell surface could be a 
more reasonable explanation. While in the first case the 
adjustment of the instrument stage by an Illumina tech- 
nician is required, in the latter case, the flow cell has to 
be removed, cleaned and reinserted into the instrument 
(otherwise the liquid will be spread to more tiles over 
the course of the run). Commonly, a band of brighter 
clusters is observed on the right side in the second track 
of lane 8 for GA instruments (Figure 4D). This is prob- 
ably caused by a reflection of laser light on the right 
flow cell edge and does not appear to cause problems 
for data analyses. 

Flow cell tilt is measured automatically with current 
software versions; if a too high value is determined, a 
wrong alignment of the flow cell in the instrument is a 
likely source. However, an uneven outer flow cell edge 
may also cause this problem; in this case, the flow cell 
tilt has to be manually set by averaging out the uneven 
edge. In case the tilt is set incorrectly, tiles will not be 
positioned correctly within the lane boundaries. If a 
black straight band is observed for a GA instrument on 
one of the four edges of multiple tiles (partially comb- 
like slots can be observed, Figure 4E) the position of 
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A: Clusters identified from image artifacts 




4 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG 
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATATATAAATTA 
1 AAAAAAAAAAAAACAAAAAACAAAAAAAAAACAAACAAAACAACAAATAA 
1 AAAAAAAATATTTAATTATTTTTATTTATAATTTTTTTGTTTTTTGTTTT 
1 AAACAAACCACACAAACAAAAAAACACAACAAAACAACACCACCACCCAA 
1 ATTCTATTTAATACAAATAAAATATCAATTTAAAACTACACTATACATAA 
1 CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA 
1 CAAATATATTTATATTTATTTTTTTATTTAATTTTTATATTTTTATTTAT 
. . . 103 non-library sequences from lint in total 



B: Sequence complexity of image artifacts 
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Figure 5 Image artifacts can generate false sequences. Cluster identification can identify crystals, dust and lint particles as well as other flow 
cell features as sequence clusters (A). Indicated are 103 non-library sequences originating from a lint particle that has been observed in a library 
that was sequenced with a three base pair tag ('GAC) in the beginning of each read. In this case, non-library sequences could therefore be 
distinguished based on these first three bases. The fraction of such artifact clusters is increased for low loading density and low intensity runs. A 
sequence entropy filter^^-I^) 10 ^^)) j s efficient for removing the majority of these sequences (82.52% for a cutoff of 0.85), but also 
removes non-artifact sequences (B) - as indicated in the figure, 0.01% of the human reference genome (GRCh37/hg19). For 375' tagged libraries 
or indexed sequencing libraries, filtering for the index/tag is therefore superior to base composition/sequence entropy filters for removing such 
sequencing artifacts. 



laser excitement may not be in complete synchroniza- 
tion with imaging (footprint). In this case the laser spot 
can be corrected by two adjustable screws in the GA 
instrument. If this effect is limited to tiles at the flow 
cell edges, oil coverage below the flow cell is not suffi- 
cient and additional oil has to be applied (Figure 4F). 

GA focus calibration reports should be checked for 
continuous x and Y values. Discontinuities in these 
values are frequently caused by confusion of the focus 
laser spot with its reflection and, like error messages of 



the spot being close to the image edge, these are the 
result of an incorrect focus laser adjustment. If the 
focus laser is not readjusted by an Illumina technician, 
incorrectly focused tiles will be obtained. High values 
for the maximum focus range in the first cycle report 
might hint at incorrect flow cell alignment which causes 
bending of the flow cell. In this case, Z-values typically 
decrease considerably from the middle of the flow cell 
towards both ends. However, if Z-values decrease or 
increase monotonically from top to bottom of the flow 
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cell the stage's tilt can be adjusted on GA instruments 
using precision mechanics screws. If not readjusted, the 
maximum focus range will be exceeded during sequen- 
cing and incorrectly focused clusters will result. 

If low intensities are observed in the first cycle report 
(e.g. due to long handling times), the primer hybridiza- 
tion should be repeated. This step can be performed 
with a sodium hydroxide wash and a hybridization pro- 
tocol either in a Cluster Station/cbot or with a PE mod- 
ule/HiSeq instrument directly in the instrument, as long 
as identical sequencing primers are being used for all 
lanes. The typical intensity range is highly dependent on 
the exact chemistry and image analysis software version 
and is available from the corresponding vendor 
documentation. 

It is advisable to temporarily store run images (only 
possible for the Genome Analyzers) or to permanently 
store reduced size thumbnails of the run image (possible 
for all platforms). When problems with a run are 
observed direct observation of images frequently pro- 
vides very useful information for troubleshooting. 

Image analysis 

The images for each of the four fluorophores, for all 
tiles per lane in each cycle have to be overlaid (regis- 
tered) and the light intensities extracted for each cluster 
and cycle. In cases where all reads, or a vast majority of 
the reads, have identical initial sequences, the image 
analysis software removes clusters as it considers these 
to originate from one initial DNA molecule and there- 
fore uninformative. This reduces the overall yield (Addi- 
tional File 2) from a sequencing run by 10% - 30% 
depending on the loading density and software version. 
Libraries produced from restriction-digested molecules 
or libraries with tag/barcode sequences added on the 
outer edge are frequently affected in this way. Starting 
the off-instrument image analysis (using the built-in 
Firecrest module) with a user-defined nr parameter, 
which sets the number of cycles used for cluster identifi- 
cation, can be used as a work-a-round. This can only be 
achieved in cases where it is possible to save the full run 
images off the instrument. The default value of this 
parameter depends on the analysis pipeline version 
(below vl.3: 1 and not configurable; vl.3 to vl.5: 2 
cycles, vl.6: 4 cycles, vl.8: 5 cycles). However, the design 
of project specific primers should be preferred over 
changing image analysis parameters, as the data transfer 
and offline analysis of images requires an additional 
investment in money and time. It is possible to change 
the number of detection cycles used by the instrument 
software (RTA). This requires additional disk space for 
on-instrument storage, and will increase the waiting 
time for calculation of the cluster position template. 
Changing the number of cycles used for cluster 



identification may also increase the fraction of artifacts 
incorrectly identified as clusters (see sequencing artifacts 
section). Further, when the majority of sequences are 
identical in the first cycles, problems with base calling 
may occur (see base calling section). 

The optimal number of clusters per mm 2 of the lane 
largely varies depending on the instrument version and 
the library being sequenced. With the software version 
RTA/OLB vl.8, a complex library without base compo- 
sition biases can be loaded with 600,000-700,000 per 
mm 2 , while a low complexity library should be loaded in 
the range of 500,000-600,000 per mm 2 . Precise values 
for each instrument and software version should be 
obtained from the vendors technical support. Differ- 
ences between low and high complexity libraries, are 
caused by an increased background signal (lower base 
qualities) and cluster tracking issues (N bases) which 
occur when many reads have the same base in a cycle 
and thus end up on the same image. If cluster densities 
are reduced, the background signal from clusters in 
close proximity is reduced which leads to higher inten- 
sity over noise values being measured. Loading the cor- 
rect amount of library DNA to obtain these designated 
cluster counts requires a precise quantification by qPCR 
or high-resolution chip-based capillary electrophoresis 
[41]. A stable and accurate library quantification proce- 
dure is therefore also one of the main prerequisites for 
performing high-quality sequencing runs. 

Sequencing artifacts and sequence quality 

The random dispersion cluster generation process per- 
formed allows for high loading densities but also com- 
plicates the identification of cluster positions from 
images. Image analysis and cluster identification algo- 
rithms used for this purpose often identify sequencing 
chemistry crystals, dust and lint particles as well as 
other flow cell features as sequence clusters (Figure 5A). 
The fraction of such artifact clusters is increased for low 
cluster densities (as the number of these artifacts does 
not necessarily increase with cluster density) and for 
low intensity runs. Low cluster intensities may have 
multiple causes including: (1) reduced cluster growth 
during bridge amplification, (2) clusters which are wide- 
spread (e.g. due to large library insert sizes), (3) ineffi- 
cient sequencing primer hybridization, (4) degraded/ 
bleached fluorophores, poor performance of the poly- 
merases due to production/storage, handling issues or 
bad scanning buffer, or (5) increased background signal. 
When low intensities are observed for the first time, pri- 
mer hybridization and first base incorporation should be 
repeated to exclude the most frequent causes. 

If the cluster identification software identifies sequen- 
cing chemistry crystals, dust and lint particles as clus- 
ters, the resulting sequences are typically of low 
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sequence complexity (i.e. consisting of long stretches of 
identical bases; Figure 5A) and may be partially also of 
low quality (Additional File 3). However, such sequences 
are typically not completely removed by signal purity/ 
quality filters. Further, freely moving crystals/lint may 
also appear in different positions in later cycles and 
obscure the signal of regular clusters. Depending on 
their size, these may even cover a larger fraction of a 
tile and thereby prohibit the correct read out of many 
clusters in one or several cycles (Figure 4B). In combi- 
nation with air bubbles (Figure 4A; caused by leaks or 
depletion of some reagent), these impermanent features 
are a frequent source of missing base calls (Ns) and 
sequencing error. The number of these fixed and mova- 
ble artifacts can be reduced by a clean sequencing set- 
up and the steps described above. In extreme cases, the 
exclusion of complete tiles from analysis during post- 
processing should be considered [16]. 

The low sequence complexity of reads from most of 
these mis-identified artifacts can be exploited by a 
sequence entropy filter or another base composition/ 
base frequency filter. For indexed sequencing libraries 
[9,10], such clusters are efficiently removed by an index 
sequence filter step. The same applies for libraries with 
tag sequences. Filtering for index and tag sequences 
should be considered superior, as other filters may also 
remove non-artifact sequences of low complexity (Figure 
5B). 

Further, random cluster dispersal results in a wide 
range of inter-cluster distances, causing different sus- 
ceptibility of individual clusters to neighboring signals. 
At the most extreme two distinct sequence populations 
can be completely merged in one mixed cluster, result- 
ing in the detection of a mixture of signals from the dif- 
ferent sequences. Depending on the ratio of the two 
sequences and their sequence similarity, the resulting 
sequence can be close to random with low overall base 
quality scores, or may be more similar to one of the ori- 
ginal molecules and show low base quality scores/higher 
error rate for only some positions in the read. This 
effect of cluster distance on signal purity causes sequen- 
cing errors to be non-randomly distributed, i.e. the frac- 
tion of reads with two errors is not equal to the squared 
fraction of reads with one error, but is considerably 
higher (Additional File 4). Hence, some clusters accu- 
mulate error disproportionately due to their close proxi- 
mity to another sequence cluster. Such clusters can be 
identified by a high frequency of low quality bases. The 
default Illumina signal purity filter called 'chastity' 
requires that for the first twelve cycles (in later versions 
of the analysis pipeline the first 25 cycles, allowing one 
outlier) corrected intensities for the bases called are 1.5 
times higher than the next highest base intensity. How- 
ever, a simple quality-score-based filter (which is by 



design highly correlated with signal purity) applied over 
all reads and not only the first bases of the run is prefer- 
able. We have found a lower base quality score cutoff of 
15 and allowing one outlier every 20 cycles to be an 
effective filter. 

Standard base calling and sequence composition 

The Illumina base caller uses a model-based approach 
for the conversion of intensity values into bases. The 
run-specific parameters of this model (so-called cross- 
talk matrix and phasing/pre-phasing values) are deter- 
mined from the first few cycles of each read. The cross- 
talk matrix is typically estimated from cycle 2, phasing 
and pre-phasing values from the first 20 cycles. This 
estimate is often incorrect when libraries have an unba- 
lanced base composition in this part of the read. Such 
unbalanced base composition is common when a 
restriction site or some tag sequence is present in either 
the forward or the reverse read (in case of paired-end 
sequencing). The only read type for which this para- 
meter estimation is not used is the index read. For the 
index, parameters calculated for the preceding read are 
applied. Therefore, for at least one lane in each run the 
base composition should be balanced over the thou- 
sands of clusters per tile, or a separate control lane has 
to be sequenced for estimating these base-calling 
parameters. 

This control lane library is not limited to the com- 
monly used 0 x 174 genome; however the choice should 
be limited to a high complexity shot-gun library from an 
organism with close to 50% GC content, to account for 
assumptions in the parameter estimation process 
[1,13,14]. A genomic shot-gun library from most species 
can be used for this purpose. mRNA sequencing 
libraries have been suggested as a valid replacement for 
the control lane, however mRNA libraries (prepared 
from the standard Illumina-protocol using double strand 
ligation protocol) show a biased base composition in the 
first twelve bases of the reads [42] and are therefore no 
longer recommended. 

Quality control, read spike-in and alternative base callers 

In addition to the estimation of base-calling parameters, 
a control lane also provides useful statistics for sequen- 
cing quality. For this purpose the addition of a spike-in 
<p x 174 sequences in every lane is highly recommended 
for runs on the Illumina platform, even in cases where a 
high complexity shot-gun library lane is available for 
parameter estimation. Given a known high quality con- 
trol library, the obtained per lane statistics for this 
library can then be compared, even between different 
runs. The choice of 0 x 174 as a quality control is again 
arbitrary; however a control sample should have the fol- 
lowing features: (1) small genome with no similarity to 



Kircher et al. BMC Genomics 201 1, 12:382 
http://www.biomedcentral.eom/1 471 -21 64/1 2/382 



Page 12 of 14 



the actual library sequenced, (2) completely known gen- 
ome sequence of the exact sample to determine error 
development and (3) high complexity and balanced base 
composition to study error patterns. If condition (1) is 
fulfilled, the spike-in can be performed without the need 
for multiplex sequencing in order to later separate the 
control reads from other library molecules. The inclu- 
sion of <fiX spike-in in each lane rather than in a sepa- 
rate control lane is also a vendor-recommended 
procedure for HiSeq sequencing. 

A fraction of less than 1% control reads is sufficient for 
creating these quality statistics for all lanes and even facili- 
tates the use of a reference-based base calling approach, 
such as AltaCyclic [11] and Ibis [14], which increases the 
base calling accuracy. Despite differences in sequencing 
chemistry, run parameters and overall run quality, the 
application of an alternative base caller typically yields a 
reduction in error rate of about 20% (Additional File 5). 
When turning off automatic parameter estimation for the 
Illumina base caller and using default values from runs of 
comparable sequencing chemistry, this low fraction of 
control reads in combination with an alternative base 
caller allows for the omission of a dedicated control lane 
for libraries with unbalanced base composition. 

Frequently projects use sequence data generated on 
different sequencing platforms, with varying versions of 
the sequencing chemistry and instrument software, or 
data produced in different facilities. This creates a need 
for assuring data quality and consistency. Quality score 
recalibration based on sample alignments to a reference 
genome has been identified as one solution to this pro- 
blem [43]. Currently the most widely applied algorithm 
is part of BROAD Institute's Genome Analysis Toolkit 
(GATK) [44]. However, quality score calibration based 
on alignments with some divergence to the actual sam- 
ple is problematic, especially if the divergence to the 
reference varies between samples. This could cause a 
biased correction of quality scores and fewer SNP calls 
for the samples with higher divergence (which is why 
the tool in GATK allows for masking of known SNP 
positions). If the quality scores calibrated on a reference 
alignment are used in an inter-species comparison and 
genome quality or species diversity varies (as is com- 
monly the case), then qualities obtained from such cali- 
brations will have a species bias (i.e. scores will be lower 
for the species with lower reference genome quality or 
higher diversity). Therefore, only spike-in control reads 
should be used for calibration of quality scores (as is 
implicitly the case for the Ibis base caller [14]). 

Conclusions 

The Illumina platform provides users with the ability to 
construct application-specific sequencing libraries using 
a variety of lab protocols and possibly different 



sequencing primers. These libraries may differ in the 
observed artifacts like adapter dimers and chimeras, the 
susceptibility to contamination from external DNA/ 
RNA sources as well as the ability of establishing a 
designated library insert size. 

What is perhaps not widely realized is that the Illu- 
mina software does not handle artifact sequences, nor 
does it filter or trim adapters. Thus, some fraction of 
insert-adapter-chimeras or pure adapter dimers may end 
up in the final data analysis; where they may cause false 
alignments or, due to unsuccessful alignment, exclude 
short-insert-size molecules from analysis. Even when 
explicitly included, the identification of adapter 
sequence and adapter chimeras is not trivial. It is ham- 
pered by reads where only a few bases of the adapter 
are read and by the higher error rates at the end of 
reads. For paired end reads the correct identification of 
adapters is assisted by maximizing autocorrelation of the 
two reads and expecting equal adapter start points. For 
short insert libraries, paired end sequencing is therefore 
preferable. Further, the merging performed for short 
insert libraries allows for considerably reduced error 
rates in the paired-end-consensus sequences. 

In addition to creating a high-quality sequencing library 
and quantifying it in order to calculate the correct loading 
density, the correct adjustment of the sequencer, handling, 
air bubbles as well as particles in the sequencing chemistry 
have a considerable impact on the quality of a run. These 
instrument adjustments are greatly simplified on the 
HiSeq instrument, however, on the earlier Genome Analy- 
zers reflections, uneven application of oil and an imper- 
fectly adjusted machine cause problems in data quality. 
Particles like chemistry crystals, dust and lint can cause 
pseudo sequence signals which then result in the inclusion 
of reads which do not correspond to molecules from the 
sequenced library. Tagging or indexing provides users 
with the ability to filter reads for bona fide library mole- 
cules and should be preferred over sequence complexity 
based methods which efficiently remove most of these 
artifact sequences, but which may introduce a bias due to 
the removal of real low complexity sequences. 

Indices/tags placed in the beginning of reads may 
reduce sequencing throughput and increase error due to 
problems introduced in image data analysis and base 
calling. Performed as separate reads [9,10], as intended 
by Illumina, the error profile of the actual reads is not 
altered and multiplexing allows for the optimal usage of 
the increasing sequencing throughput. 

Spike-in controls included in all lanes facilitates the 
assessment of run quality for individual lanes and whole 
runs. We suggest that improved base callers should be 
considered to obtain sequences of increased quality and 
that the PHRED-like base quality scores should be used 
for quality-based filtering based on the complete read 
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rather than only on the beginning of the read. Quality 
score based filters are equally suited for filtering clusters 
accumulating error due to their close proximity to other 
sequence clusters and may also remove reads affected 
by freely moving artifacts in later sequencing cycles. 

We summarize the most important principles as follows: 
(1) Filter all Illumina-supplied buffers and perform pre- 
run checks for leaks and instrument adjustment. (2) 
Check quality statistics as well as images for artifacts and 
the correct adjustment of the machine. (3) Filter sequence 
data for library artifacts like adapters and chimeras and 
examine the data for traces of contamination. (4) Remove 
artificial clusters by filtering for sequence complexity or, if 
possible, filter for tags/indexes. (5) Filter low quality reads 
based on quality scores of complete sequences. (6) Use 
alternative base callers to obtain the maximum yield of 
high quality sequences from a run. These simple guide- 
lines enable the identification and elimination of most of 
the problems commonly encountered in sequencing runs 
done on the Illumina sequencing systems. 

Methods 

Simulated data 

Simulated reads were obtained by extracting ten thou- 
sand 350nt long sequences, not containing N characters, 
from all chromosomes/contigs of at least 1 Mb in the 
human hgl9/GRCh37 assembly. These sequences were 
then trimmed for the different molecule lengths/adapter 
start positions and paired end reads created. For 
sequences below the read length of 101, the forward and 
the reverse read adapter sequences were added (forward: 
AGATCGGAAGAGCGGTTCAGCAG GAATGC CGA- 
GACCGATCTCGTATGCCGTCTTCTGCTTG, reverse: 
AGATCGGAAGAGC GGTTCAGCAGGAAT GCCGA- 
GACCGATCTCGTATGCCGTCTTCTGCTTG) and if 
required extended by further A bases at the end. On the 
resulting 3.5 million reads, the error profile extracted 
from the control reads of a 2 x 101 cycle version 4 
sequencing chemistry run was applied by randomly 
mutating bases at the observed rate for each position and 
using two different approaches for simulating quality 
scores: (1) the quality score was set to the average error 
observed for the specific base-type in this cycle (i.e. all 
Adenines at the same position in the read have the same 
quality score) or (2) error-informative quality scores were 
created by adding a random number between 0 and 10 
(uniform sampling) to the average quality score of this 
base when the correct base was simulated and subtract- 
ing a random number between 0 and 10 (uniform sam- 
pling) when a wrong base was simulated. 

Merging of paired end reads 

The two reads from each cluster were merged providing 
the adapter sequences and requiring at least an lint 



overlap between the two reads. Reads were merged by 
sliding the reverse complement sequence of the reverse 
read along the forward read and determining the quality 
score adjusted sequence identity of forward and reverse 
read for the different adapter start positions. Reads were 
merged if the highest observed sequence identity in the 
read overlap was at least 90%. In the overlapping 
sequence, quality scores were combined (assuming equal 
likelihood for non-observed bases) and the base with the 
highest base quality score was called. 

Additional material 



Additional File 1: Merging of paired end reads efficiently removes 
adapter sequence for short insert libraries and increases read 
accuracy. Shown is the average sequencing error of the two simulated 
raw reads (black) in comparison to the sequencing error remaining after 
read merging for different adapter start points. The development is 
shown for two different types of simulated quality scores (red and 
green). In red, the quality score is the average error observed for the 
specific base-type in this cycle (i.e. all Adenines at this position in the 
read have the same quality score), while in green an error-informative 
quality score was simulated. For this type of quality score a random 
number between 0 and 10 (uniform sampling) was added to the 
average quality score of this base when the correct base was simulated 
and a random number between 0 and 10 (uniform sampling) was 
subtracted if a wrong base was simulated. The average reduction of error 
(starting from 0.244%) is 1.93 x (0.126%) for the position-dependent 
quality scores and 4.98 x (0.049%) for the error-informative quality scores. 
For sequences shorter or equal to read length (5-1 01 nt) a reduction of 
error (0.146%) by a factor of 1.62 x (0.090%) and 20.88 x (0.007%) is 
observed, respectively. Sequences are required to have more than lOnt 
overlap for merging and merged sequences below 5nt are discarded as 
adapter dimers by the program. 

Additional File 2: Reduction of the number of clusters identified in 
tile images due to identical tag sequences. If all or the vast majority 
sequences start identical in the first read of a sequencing run, image 
analysis will consider a higher fraction of the clusters as being grown 
into each other and remove them. This effect is for example observed if 
libraries are made from restriction digested molecules or if tag/barcode 
sequences are added on the outer molecule edges and read in the first 
read. Changing parameters for an image offline analysis (Firecrest 
module) can be used as a work-a-round. The figure table shows cluster 
counts as well as a section of the image of the same tile in cycle 1 and 
4 for a run from the Neandertal Genome project (Green et al: Science 
2010) 080902_BIOLAB29_Run_PE51_1 in which the tag 'GAC was read in 
the beginning of the first read. Cluster counts were obtained from IPAR 
v1.01 image analysis (cluster identification based only on the first cycle of 
the run) and the results for a version of the Firecrest v1.9.5 algorithm, in 
which cluster identification was done in cycle 4. 

Additional File 3: Quality score distribution of artifact reads largely 
overlaps with the quality score distribution of regular reads. 

Sequences resulting from crystals, dust and lint particles as well as other 
flow cell features are typically of low complexity (Additional File 2) but 
only partially of low quality. Plotted is the quality score frequency 
distribution (PHRED-scale, Ibis base caller) for all reads matching the 
'GAC library tag in the beginning of the read (black, n = 557,466,159 
bases from 10,930,709 reads) as well as all sequences not matching the 
tag sequence and its one base pair substitutions (red, n = 3,481,668 
bases from 68,268 reads). The data was obtained from lane 5 of the 
080902_BIOLAB29_Run PE51J run from the Neandertal Genome project 
(Green et al: Science 2010). 

Additional File 4: Non-random distribution of sequencing error 
across sequencing clusters. Random cluster generation results in a 
wide range of inter-cluster distances, causing sequencing error to be 
non-randomly distributed across clusters. The fraction of reads with two 
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errors is not equal to the squared fraction of reads with one error. 
Shown are the observed rates for reads with 1 to 5 errors for different 
lllumina Genome Analyzer data sets (solid lines) presented as test data 
sets for the Ibis base caller (Kircher et al: Genome Biology 2009) and the 
expected rates when extrapolating from the fraction of molecules with 
one error (dashed line). 

Additional File 5: Reduction in sequencing error when using the 
Ibis base caller for different instrument chemistries. Alternative base 
callers significantly reduce error rate and thereby increase the output of 
usable reads. The Ibis base caller (Kircher et al. Genome Biology 2009) 
has a wide support for different instrument and software versions, as 
well as for single read, paired-end read and multiplex sequencing runs. It 
is based on training sequencing cycle-specific machine learning models 
from a training data set, like for example a 0 x 174 spike-in control. 
Based on this data, also quality scores are adjusted for each run and are 
therefore comparable between sequencing runs and libraries without 
further normalization. 
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